Overview for the Second Shared Task on Language Identification in Code-Switched Data
نویسندگان
چکیده
We present an overview of the second shared task on language identification in codeswitched data. For the shared task, we had code-switched data from two different language pairs: Modern Standard ArabicDialectal Arabic (MSA-DA) and SpanishEnglish (SPA-ENG). We had a total of nine participating teams, with all teams submitting a system for SPA-ENG and four submitting for MSA-DA. Through evaluation, we found that once again language identification is more difficult for the language pair that is more closely related. We also found that this year’s systems performed better overall than the systems from the previous shared task indicating overall progress in the state of the art for this task.
منابع مشابه
Overview for the First Shared Task on Language Identification in Code-Switched Data
We present an overview of the first shared task on language identification on codeswitched data. The shared task included code-switched data from four language pairs: Modern Standard ArabicDialectal Arabic (MSA-DA), MandarinEnglish (MAN-EN), Nepali-English (NEPEN), and Spanish-English (SPA-EN). A total of seven teams participated in the task and submitted 42 system runs. The evaluation showed t...
متن کاملA Neural Model for Language Identification in Code-Switched Tweets
Language identification systems suffer when working with short texts or in domains with unconventional spelling, such as Twitter or other social media. These challenges are explored in a shared task for Language Identification in Code-Switched Data (LICS 2016). We apply a hierarchical neural model to this task, learning character and contextualized word-level representations to make word-level ...
متن کاملThe CMU Submission for the Shared Task on Language Identification in Code-Switched Data
We describe the CMU submission for the 2014 shared task on language identification in code-switched data. We participated in all four language pairs: Spanish–English, Mandarin–English, Nepali–English, and Modern Standard Arabic–Arabic dialects. After describing our CRF-based baseline system, we discuss three extensions for learning from unlabeled data: semi-supervised learning, word embeddings,...
متن کاملLanguage Identification in Code-Switched Text Using Conditional Random Fields and Babelnet
The paper outlines a supervised approach to language identification in code-switched data, framing this as a sequence labeling task where the label of each token is identified using a classifier based on Conditional Random Fields and trained on a range of different features, extracted both from the training data and by using information from Babelnet and Babelfy. The method was tested on the de...
متن کاملDCU-UVT: Word-Level Language Classification with Code-Mixed Data
This paper describes the DCU-UVT team’s participation in the Language Identification in Code-Switched Data shared task in the Workshop on Computational Approaches to Code Switching. Wordlevel classification experiments were carried out using a simple dictionary-based method, linear kernel support vector machines (SVMs) with and without contextual clues, and a k-nearest neighbour approach. Based...
متن کامل